In this lesson, we will learn to stack dataframes on top of each other and side by side.
The dataframes we will stack on top of each other will be the governor’s two 2019 half-yearly data frames. We will do that on Wednesday.
The dataframes we will join side-by-side will be the voters_yancey and vhis_yancey. We will do this today.
We start with two dataframes: * voters_yancey * vhis_yancey
Can you remember how to tell how many columns and rows each has?
- voters_yancey = 16,351 x 71
- vhis_yancey = 79,915 x 15
Before we get started, let’s load the tidyverse packages into our library…
rr library(tidyverse)
(We will be using the “tidy” way of joining tables together, but JOIN functions from tidyverse’s dplyr package are very similar to the merge() function in base R.)
You may also need to load the data:
rr # This is not yet a publicly accessible URL.
#yancey_url<- ://adminliveunc.sharepoint.com/sites/MEJO570Spring2020/Shared%20Documents/Joins%20Lectures/March2325.RData
#download.file(yancey_url, destfile = ./March2325.RData)
#load(./March2325.RData)
load(~/Downloads/March2325.RData)
The first thing we will do is create a dataframe called all_yancey. We will create it by applying the left_join function to the voters_yancey and vhis_yancey dataframes.
rr all_yancey<- left_join(voters_yancey, vhis_yancey)
Joining, by = c(\county_id\, \county_desc\, \voter_reg_num\, \ncid\)
rr all_yancey
You can see that the new dataframe we created has 81,202 rows x 82 columns. (It has all 16,351 voters rows x every time it matches a row in vhis_yancey.)
You might ask yourself, “How does R know that a row in one dataframe is supposed to be joined with a row in another dataframe?”
The join functions will look at the column names that are common to both and make the assumption that these are the columns on which they should be joined. This is a feature that you do not find in the SQL database language that is commonly used to query databases.
It’s a nice feature as long as it works. Here are reasons it might not work:
The columns in the two dataframes have matching names, but they contain different types of information. For example the word “name” is a pretty generic column. But in one dataframe it might be the name of a person and in the other it might be the name of a street. You wouldn’t want to join those.
The two dataframes might have columns in common, but they might be called different things. For example, the names of streets might be called “StName” in one dataframe and “Street_Name” in another. These would not match, but should.
The choice of columns that R makes might not yield uniqueness. In our example here, we just need to join on “ncid” because – allegedly – that number is unique to each voter across the state. But if we were joining a dataframe that had all 100 counties in it and we didn’t have ncid, we would need to join by both county_id as well as voter_reg_num because voter_reg_num may be duplicated from county to county. (Incidentally, in this case we do not need county_desc. It’s a synonym for county_id and just slows down the matching.)
Order of arguments matters
Let’s switch the order in which we pass dataframes to the left_join function.
rr left_join(vhis_yancey, voters_yancey)
Joining, by = c(\county_id\, \county_desc\, \voter_reg_num\, \ncid\)
This time we get about 2,000 fewer rows. We get 79,915 and 82 columns. This is all 79,915 rows from vhis x every time it matches a row in the voters dataframe.
Explicitly stating the columns to join
We can also explicitly tell R which columns to use to join. We do this by adding the “by” argument and setting its value to the name of whatever column we want to use in the join. In this case, by = “ncid”.
rr left_join(vhis_yancey, voters_yancey, by = )
This also results in 79,915 rows, but this time we get 85 columns. So we end up with some duplicate columns because we didn’t use them in the join. Note how they are designated with a “.y” or “.x” that gets appended to the names.
(Note: This data isn’t always so clean. I’ve run this exact same query with older versions of these exact same tables and found that it appeared that voters with the same ncid voted twice in the same election. Always be on the lookout for things like when you’re joining dataframes. Remember: When you see things like that it means only one of two things – a good story or bad data. In either case, interviews with humans are required.)
When can drop those three duplicate columns if we want …
rr all_yancey <- left_join(voters_yancey, vhis_yancey, by = )
all_yancey %>% select(-county_id.y, -county_desc.y, -voter_reg_num.y)
Is there a right_join?
Yes, there is such a thing as a right join, but I just use left_join and change the order of the dataframe arguments.
Full Joins
Full joins return all rows and all columns from both dataframes.
rr full_yancey <- full_join(voters_yancey, vhis_yancey, by=)
full_yancey
This gives us 81,294 rows. This is one row for each row in the voters_yancey dataframe that matches a row in the vhis_yancey dataframe PLUS all of the rows in either dataframe that do not have a match.
As you can see in the next line of code, the order of arguments in full joins does not matter.
rr full_join(vhis_yancey, voters_yancey, by=) r NA
Inner Joins
With inner joins, rows that don’t have a match are just dropped.
rr inner_yancey <- inner_join(vhis_yancey, voters_yancey, by=)
inner_yancey
Like in full_joins the order of arguments does not matter.
rr inner_join(voters_yancey, vhis_yancey, by=)
Practical Application
That’s all cool in theory, but in what practical cases would you want to use one kind of join rather than another.
One example is to find voters who are registered but have never voted. For this we will use a left join. We will first do a left join and then filter to find all rows that have values in the election_lbl column.
Remember, this is saying: “Hey R, give me all the voters from the voters_yancey dataframe. Match those to their voting history anytime there is a row in the vhis_yancey dataframe that has a matching ncid.”
Because you are asking for all voters regardless of whether they have a match in the vhis dataframe, you will get values in any of the ncvhis columns anytime there isn’t a match.
This means that after doing the join, we can find the rows that have values in the voter history dataframe and know that these are voters for whom there is no record of them participatig in an election.
rr all_yancey<- left_join(voters_yancey, vhis_yancey, by = )
all_yancey %>% filter(is.na(election_lbl))
It appears there are 1,379 people in Yancey County who are registered but you have never voted. (Note: Not all of those voters are “Active,” but that’s a conversation for a different day.)
If we switch the order of arguments in the left join, we would get the names of anyone anyone who voted in a Yancey County election who is not currently registered in Yancey County.
rr all_yancey<- left_join(vhis_yancey, voters_yancey, by = )
all_yancey %>% filter(is.na(last_name))
You will see that it appears as if there are 92 voter history records that do not match an ncid of a person who is on the voter registration roles in Yancey County. It would be important to understand that why this is the case. This may be a bookkeeping anamoly or something else.
We might wonder whether these 92 voter history records that do not appear to have a matching person are 92 instances of a particular person, 85 different people in one election, or some combination.
Let’s ask R to show us the rows that are missing a last name… then group those rows by ncid (one group for each unique ncid) … then count the number of rows in each group… then put the result of that counting summary into a column called “unlisted” (although we could call that column anything) … and then show us the results of all that ordered by the counting summary values in the unlisted column in descending order.
rr all_yancey %>% filter(is.na(last_name)) %>% group_by(ncid) %>% summarize(unlisted = n()) %>% arrange(desc(unlisted))
OK, so it’s 38 unique ncid values … 38 “people”. And look at the values in the ncid column. They are different lengths. That seems strange because unique identifiers – like Social Security numbers – are usually the same number of characters.
Sure enough, if we look at https://s3.amazonaws.com/dl.ncsbe.gov/data/layout_ncvoter_ncvhis.txt it says that the ncid column should be 12 characters long. (In reality, most I see are six characters long. But there’s clearly a discrepency here that might help us inquire with the agency about it.)
The next thing we might do is take a look at just one of these ncids for which there is a missing person. Maybe we will see a pattern…
rr all_yancey %>% filter(ncid==39580) %>% arrange(election_lbl)
Interesting. All of these show that the “voted_county_desc” is CABARRUS. That makes me wonder if this weirdness is an artifact of people moving between counties.
I wonder if the voter history records that have missing people all come from certain counties or from certain election dates. Maybe some of the records just haven’t transferred yet.
Let’s make one group for each county, and then sub-groups for each election date (in this cased called “election_lbl”) for each county.
rr all_yancey %>% filter(is.na(last_name)) %>% group_by(voted_county_desc, election_lbl) %>% summarize(unlisted = n()) %>% arrange(desc(unlisted))
Filtering Joins
Before we finish, let’s look at two other types of joins that also have specific practical application to our problem here.
We will now look at semi_join and anti_join. These are called “filtering joins” because they use the second dataframe in the function to filter rows from the first dataframe.
If you want to see only voters that have voted, use “semi_join”. A semi_join is like an inner_join, but instead of giving you one row from the first dataframe EVERY time it matches a row in the second dataframe, this gives you only unique values from the first dataframe. And it only gives you columns from the from the first dataframe. It is just filtering the first dataframe, not really joining it with the second.
rr semi_join(voters_yancey, vhis_yancey, by=)
Note that the number of columns in this result are equal to the number of columns in voters_yancey.
If you want to see only voters who have NEVER voted, we use anti_join. This gives us only rows from the first dataframe that have NO matches in the second dataframe.
rr anti_join(voters_yancey, vhis_yancey, by=)
So now we’ve covered the main ways you would join dataframes side by side. In the next lesson we will look at different ways to put dataframes on top of each other.
---
title: "Class 17: Joining Data Frames"
output: html_notebook
---

---
title: "Class 17: Joining Data Frames"
output: html_notebook
---

In this lesson, we will learn to stack dataframes on top of each other and side by side.

The dataframes we will stack on top of each other will be the governor's two 2019 half-yearly data frames. We will do that on Wednesday.

The dataframes we will join side-by-side will be the voters_yancey and vhis_yancey. We will do this today.

We start with two dataframes:
* voters_yancey 
* vhis_yancey 

Can you remember how to tell how many columns and rows each has?

* voters_yancey = 16,351 x 71
* vhis_yancey = 79,915 x 15


Before we get started, let's load the tidyverse packages into our library...

```{r}
library(tidyverse)
```

(We will be using the "tidy" way of joining tables together, but JOIN functions from tidyverse's dplyr package are very similar to the merge() function in base R.)



You may also need to load the data:
```{r}
# This is not yet a publicly accessible URL. 

#yancey_url<- "https://adminliveunc.sharepoint.com/sites/MEJO570Spring2020/Shared%20Documents/Joins%20Lectures/March2325.RData"

#download.file(yancey_url, destfile = "./March2325.RData")

#load("./March2325.RData")

load("~/Downloads/March2325.RData")


```



The first thing we will do is create a dataframe called all_yancey. We will create it by applying the left_join function to the voters_yancey and vhis_yancey dataframes.


```{r}
all_yancey<- left_join(voters_yancey, vhis_yancey)

all_yancey
```

You can see that the new dataframe we created has 81,202 rows x 82 columns. (It has all 16,351 voters rows x every time it matches a row in vhis_yancey.)

You might ask yourself, "How does R know that a row in one dataframe is supposed to be joined with a row in another dataframe?"

The join functions will look at the column names that are common to both and make the assumption that these are the columns on which they should be joined. This is a feature that you do not find in the SQL database language that is commonly used to query databases.

It's a nice feature as long as it works. Here are reasons it might not work:

* The columns in the two dataframes have matching names, but they contain different types of information. For example the word "name" is a pretty generic column. But in one dataframe it might be the name of a person and in the other it might be the name of a street. You wouldn't want to join those.

* The two dataframes might have columns in common, but they might be called different things. For example, the names of streets might be called "StName" in one dataframe and "Street_Name" in another. These would not match, but should.

* The choice of columns that R makes might not yield uniqueness. In our example here, we just need to join on "ncid" because -- allegedly -- that number is unique to each voter across the state. But if we were joining a dataframe that had all 100 counties in it and we didn't have ncid, we would need to join by both county_id as well as voter_reg_num because voter_reg_num may be duplicated from county to county. (Incidentally, in this case we do not need county_desc. It's a synonym for county_id and just slows down the matching.)


**Order of arguments matters**

Let's switch the order in which we pass dataframes to the left_join function.

```{r}
left_join(vhis_yancey, voters_yancey)
```

This time we get about 2,000 fewer rows. We get 79,915 and 82 columns. This is all 79,915 rows from vhis x every time it matches a row in the voters dataframe.


**Explicitly stating the columns to join** 

We can also explicitly tell R which columns to use to join. We do this by adding the "by" argument and setting its value to the name of whatever column we want to use in the join. In this case, by = "ncid".

```{r}
left_join(vhis_yancey, voters_yancey, by = "ncid")
```

This also results in 79,915 rows, but this time we get 85 columns. So we end up with some duplicate columns because we didn't use them in the join. Note how they are designated with a ".y" or ".x" that gets appended to the names.

(Note: This data isn't always so clean. I've run this exact same query with older versions of these exact same tables and found that it appeared that voters with the same ncid voted twice in the same election. Always be on the lookout for things like when you're joining dataframes. Remember: When you see things like that it means only one of two things -- a good story or bad data. In either case, interviews with humans are required.)


When can drop those three duplicate columns if we want ...
```{r}
all_yancey <- left_join(voters_yancey, vhis_yancey, by = "ncid")

all_yancey %>%
  select(-county_id.y, -county_desc.y, -voter_reg_num.y)
```



**Is there a right_join?** 

Yes, there is such a thing as a right join, but I just use left_join and change the order of the dataframe arguments.



**Full Joins**

Full joins return all rows and all columns from both dataframes. 

```{r}
full_yancey <- full_join(voters_yancey, vhis_yancey, by="ncid")

full_yancey
```

This gives us 81,294 rows. This is one row for each row in the voters_yancey dataframe that matches a row in the vhis_yancey dataframe PLUS all of the rows in either dataframe that do not have a match.

As you can see in the next line of code, the order of arguments in full joins does not matter.


```{r}
full_join(vhis_yancey, voters_yancey, by="ncid")

```



**Inner Joins**

With inner joins, rows that don't have a match are just dropped.

```{r}
inner_yancey <- inner_join(vhis_yancey, voters_yancey, by="ncid")

inner_yancey
```

Like in full_joins the order of arguments does not matter.

```{r}
inner_join(voters_yancey, vhis_yancey,  by="ncid")
```



**Practical Application**

That's all cool in theory, but in what practical cases would you want to use one kind of join rather than another.

One example is to find voters who are registered but have never voted. For this we will use a left join. We will first do a left join and then filter to find all rows that have <NA> values in the election_lbl column.

Remember, this is saying: "Hey R, give me all the voters from the voters_yancey dataframe. Match those to their voting history anytime there is a row in the vhis_yancey dataframe that has a matching ncid."

Because you are asking for all voters regardless of whether they have a match in the vhis dataframe, you will get <NA> values in any of the ncvhis columns anytime there isn't a match.

This means that after doing the join, we can find the rows that have <NA> values in the voter history dataframe and know that these are voters for whom there is no record of them participatig in an election.

```{r}
all_yancey<- left_join(voters_yancey, vhis_yancey, by = "ncid")

all_yancey %>% 
  filter(is.na(election_lbl))
```

It appears there are 1,379 people in Yancey County who are registered but you have never voted. (Note: Not all of those voters are "Active," but that's a conversation for a different day.)


If we switch the order of arguments in the left join, we would get the names of anyone anyone who voted in a Yancey County election who is not currently registered in Yancey County.

```{r}
all_yancey<- left_join(vhis_yancey, voters_yancey,  by = "ncid")

all_yancey %>%
  filter(is.na(last_name))
```

You will see that it appears as if there are 92 voter history records that do not match an ncid of a person who is on the voter registration roles in Yancey County. It would be important to understand that why this is the case. This may be a bookkeeping anamoly or something else.


We might wonder whether these 92 voter history records that do not appear to have a matching person are 92 instances of a particular person, 85 different people in one election, or some combination. 

Let's ask R to show us the rows that are missing a last name... then group those rows by ncid (one group for each unique ncid) ... then count the number of rows in each group... then put the result of that counting summary into a column called "unlisted" (although we could call that column anything) ... and then show us the results of all that ordered by the counting summary values in the unlisted column in descending order.


```{r}
all_yancey %>%
  filter(is.na(last_name)) %>%
  group_by(ncid) %>%
  summarize(unlisted = n()) %>%
  arrange(desc(unlisted))
```

OK, so it's 38 unique ncid values ... 38 "people". And look at the values in the ncid column. They are different lengths. That seems strange because unique identifiers -- like Social Security numbers -- are usually the same number of characters. 

Sure enough, if we look at https://s3.amazonaws.com/dl.ncsbe.gov/data/layout_ncvoter_ncvhis.txt it says that the ncid column should be 12 characters long. (In reality, most I see are six characters long. But there's clearly a discrepency here that might help us inquire with the agency about it.)

The next thing we might do is take a look at just one of these ncids for which there is a missing person. Maybe we will see a pattern... 

```{r}
all_yancey %>%
  filter(ncid=="CG39580") %>%
  arrange(election_lbl)
```

Interesting. All of these show that the "voted_county_desc" is CABARRUS. That makes me wonder if this weirdness is an artifact of people moving between counties. 

I wonder if the voter history records that have missing people all come from certain counties or from certain election dates. Maybe some of the records just haven't transferred yet.

Let's make one group for each county, and then sub-groups for each election date (in this cased called "election_lbl") for each county.

```{r}
all_yancey %>%
  filter(is.na(last_name)) %>%
  group_by(voted_county_desc, election_lbl) %>%
  summarize(unlisted = n()) %>%
  arrange(desc(unlisted))
```



**Filtering Joins**

Before we finish, let's look at two other types of joins that also have specific practical application to our problem here.

We will now look at semi_join and anti_join. These are called "filtering joins" because they use the second dataframe in the function to filter rows from the first dataframe.



If you want to see *only* voters that have voted, use "semi_join". A semi_join is like an inner_join, but instead of giving you one row from the first dataframe EVERY time it matches a row in the second dataframe, this gives you only unique values from the first dataframe. And it only gives you columns from the from the first dataframe. It is just *filtering* the first dataframe, not really joining it with the second.

```{r}
semi_join(voters_yancey, vhis_yancey, by="ncid")
```

Note that the number of columns in this result are equal to the number of columns in voters_yancey.



If you want to see only voters who have NEVER voted, we use anti_join. This gives us only rows from the first dataframe that have NO matches in the second dataframe.

```{r}
anti_join(voters_yancey, vhis_yancey, by="ncid")
```


So now we've covered the main ways you would join dataframes side by side. In the next lesson we will look at different ways to put dataframes on top of each other.
